Formative assessment of practical skills with peer-assessors: quality features of an OSCE in general medicine at the Heidelberg Medical Faculty

Background: Objective Structured Clinical Examinations (OSCEs) have become an established examination format at German medical faculties. Medical experts routinely use a summative assessment to evaluate practical and communicative skills, while the use of the OSCE format by student examiners, as a formative examination, remains rather limited. Objective: The formative OSCE program of the Department of General Practice and Implementation Research at the Heidelberg Medical Faculty, which is conducted and evaluated by peer tutors, is examined with regard to its quality criteria and compared with summative OSCEs from other departments. Methods: Difficulties and discriminatory power of individual testing stations were determined for the summative, as well as the formative OSCE, and compared with each other. To assess the reliability of the measurements, an analysis of the data was carried out using the Generalizability theory. In addition, a comparison is made between the assessments of student examiners and second assessments by medical experts. Results: The stations of the formative OSCE show similar difficulties as those of the summative comparison OSCEs (Pform=0.882; Psum=0.845 – 0.902). With respect to measurement reliability, there are no differences between the OSCE in General Medicine and the other subjects. The assessments of student examiners and medical experts correlate highly (r=0.888). Conclusion: The formative OSCE in General Medicine is comparable to the summative comparison formats in terms of its quality criteria. The use of student examiners can be a reliable alternative to medical experts in formative OSCEs.


Introduction
Practical clinical skills and anamnesis are already being taught at various medical faculties in the preclinical study semesters and tested with the help of an Objective Structured Clinical Examination (OSCE). It has been shown that an early learning of practical skills leads to better results in the clinical examination sections and clinical skills [1]. Traditionally, the teaching content is taught by faculty physicians, but increasingly also by student tutors of higher semesters. An advantage of peer tutors (Peer Assisted Learning, PAL) is the higher acceptance by students [2], lower costs [3], [4] and the possibility of smaller learning groups [5]. In addition, the students benefit from a reduction of stress and anxiety factors [6] and the student tutors [2], [7] benefit from their own in-depth study of the learning content. When comparing the student tu-tors with faculty members, the PAL students achieve the same results in final exams [8], [9], [10], [11] and the same or even higher quality of feedback [10]. Prerequisites for this are precisely defined student tutor training courses and checklists [12], [13]. Since 2013, practical skills and anamnesis techniques have been taught at the Heidelberg Medical Faculty in the pre-clinical part of the AaLPLUS courses (AaL: "Living Anatomy Plus") of the Department of General Practice and Implementation Research with the help of peer tutors and subsequently examined in a formative OSCE, also conducted by student tutors [14]. A detailed description of the program and the evaluation of the OSCE by students and peer tutors can be found in [15]. Black and Wiliam [16] see five essential aspects of formative examinations. These are adapted to the context of University education: 1. Clarification and exchange of learning goals and success criteria 2. Initiating effective discussions and other learning tasks that demonstrate students' understanding of the learning content 3. Feedback that is useful for the students 4. Encouraging students to act as a mutual learning resource 5. Encouraging students to see themselves as initiators of their own learning activities These objectives involve a whole process of teaching in which more or less continuously formative examinations are integrated. This is often logistically difficult to achieve fully in formative practical examinations in the form of OSCEs in medical education, so that the formative OSCE considered here should rather be seen as an instrument [17], which comes at the end of the pre-clinical part of the study. In order to achieve the goals announced by Black and Wiliam, other forms of formative examination procedures should be suitable [18]. Despite this limited function of the formative OSCE, it can be expected to have a positive effect on the learning behaviour of the examined students [19], [20]. In a review article by Khan et al. from 2017, 13 publications on the topic of "Students as examiners in OSCEs" are presented in more detail [21]. Some of the papers listed there examine the assessments of students and experts with regard to basic characteristics such as differences in the scores awarded and the correlation of the assessments of students and experts as examiners. A more detailed quantitative analysis, which also includes a differentiation of station-and examiner effects and their consequences for measurement reliability, is only provided in the works of Moineau et al. [10] and Basehore et al. [22]. In both studies, double evaluations at the stations by students and experts are investigated (in [22] the experts evaluated using videos of the examinations). However, it was not investigated whether student examiners differ from experts with regard to the extent of exam effects.
Besides the comparison of student examiners and experts in the same formative examination, the quality of the formative examination in relation to the summative examinations established at the faculty is also of interest. Formative examinations differ in their objectives and structure (e.g. higher importance of feedback) and relevance of summative examinations to the examined students. The latter in particular, can have an effect on the reliability and accuracy of measurements, e.g. if the performance of the candidates is less differentiated due to reduced motivation.

Aim of the study
The aim of the study was to demonstrate 1. that students in the context of formative examinations of practical skills can replace experts as examiners without compromising the quality of the examination and 2. that the quality of such formative examinations reaches the same standards as established summative examinations.
To this end, the formative OSCE in General Medicine at the Heidelberg Medical Faculty, which was held in 2018 and involved tutors as examiners, was examined with regard to its quality criteria (characteristics of the stations, measurement reliability of the exam, extent of examiner effects). A comparison was made with summative OSCEs, and a matching between the assessments of student examiners and those of experts ("supervisors") was considered.
Other aspects of the formative OSCE in General Medicine with student examiners, such as acceptance by both examiners and examined students, assessment of the quality of feedback and subjective benefit to both students and examiners of the skills assessed in the OSCE are described in detail in [15]. The present study focuses exclusively on the quality characteristics of the OSCE that can be measured by statistical parameters of the examination results. Standard analyses of tests usually include basic parameters such as difficulty, selectivity and reliability (see 3.1.1). Based on the Generalizability theory, the facets (influencing factors) "students" (differences in the ability of students), "station" (difference in the difficulty of stations), "examiner" (difference in the "strictness" of examiners) and the interaction "station x examiner" (different strictness of examiners at different stations) and their effects on generalizability and absolute measurement accuracy (see 3

Implementation of the OSCE
The formative OSCE General Medicine in May 2018 was attended by 300 students of the fourth semester. The OSCE took place over two days and comprised four testing stations. One of the four stations ("venous blood sampling") was completed by all students. Various clinical examinations had to be performed at two stations. These stations were not identical for the participating students, but alternated between the different parcours. A total of 11 different tasks were used (general examination of the abdomen, examination of spleen/kidney/appendicitis signs, blood pressure measurement, examination of the heart, liver, lymph node status, pulse status, thyroid gland, thorax, spine and a neurological examination). Furthermore, a complete anamnesis had to be taken. Here too, the contents changed (back, abdomen and head). Trained acting patients were used for the clinical examinations and the anamnesis. The contents of the stations and the essential criteria for evaluation were known to the participating students from the previous tutorials and given materials. Each participant went through a total of four stations of eight minutes duration (5 minutes per task and 3 minutes feedback). The assessment of performance was carried out using checklists by students with basic didactic training who were at least in their sixth semester. A total of 25 points could be achieved at each of the stations. An exception to this were the three stations where an anamnesis had to be taken. In these, 30 points were to be achieved. 32 students were used as examiners, 26 of whom examined at several (up to five) stations during the course of the OSCE (see table 1). The assessments were recorded using tablet computers (Programm tOSCE des UCAN-Prüfungsverbunds) [23]. Five supervisors were appointed to monitor the quality of implementation and evaluation, who carried out random second evaluation (135 evaluations in total). The trained examiners were (medical) staff members of the Department of General Practice and Implementation Research and, for the assessment of communicative skills at the three anamnesis stations, lecturers of the Department of Medical Psychology.

Comparison with summative OSCEs
Six  (see  table 2). These OSCEs were performed on two to three days in two parallel courses (viz. "parcours"). The stations were partly changed in the different parcours. The two subjects Internal Medicine and Surgery were chosen because: 1. different examiners were used at the same testing stations and 2. the examiners were generally employed at different stations.
This allows an estimation of the examiner, stations and the interaction effect station x station during the evaluation.

Statistical analysis
Difficulty P and corrected selectivities r it (correlations of the number of points achieved at one station with the points achieved at all other stations) as well as the mean inter-correlations with all other stations r ij (mean interitem correlation) were determined for the stations of all mentioned OSCEs. The product-moment correlation (according to Pearson, two-tailed P value) was used throughout as a correlation measure. In order to achieve equivalence of the stations, the point values obtained at the anamnesis stations, where 30 points were to be achieved, were rescaled to the range of 0-25 points for all analyses presented.
To assess the reliability of the measurements, the data were analysed using the Generalisability theory [24]. The facets considered were "students", "stations", "examiners" and the interaction "station x examiner". From the variance components found by applying the Generalizability theory, the "generalizability" Eρ 2 (as an analogy to internal consistency/Cronbachs α) and the "dependability" Φ can be determined as a measure of absolute measurement accuracy: If n denotes the number of stations, then In order to analyse the matching between the assessments of the student examiners and the supervisors, the scores awarded for each station were compared (Wilcoxon signed-rank test) and the correlations determined. Furthermore, an analysis of variance of the total data set (examiners and supervisors) with the fixed factor "student examiner/supervisor" and the facets "students", "stations", "student examiners", "supervisor" and the interaction "station x examiner" was carried out. Note: When analysing with the Generalizability theory, a distinction must be made between so-called fixed and random factors. If the facet "student" is considered a random factor, the intention is to generalise to equivalent groups of students (i.e. in the same semester, same demographic composition, equivalent teaching etc.). The group of students considered in the examination being analysed should therefore be regarded as a sample from a population. The same applies to the facet "station": As a random factor, the focus is on generalizability to equivalently constructed stations, while the facet "examiners" involves examiners from a potential group of examiners. When modelling the station or examiner as a fixed factor, however, the focus is on the stations or examiners actually used in the exam: Are individual stations particularly easy or difficult, are examiners too strict or too lenient? Since the present study focuses on generalizability, only the results for the analyses with "student", "station" and "examiner" are presented as random factors. The statistical analyses were performed with R Version 3.5.1. For the mixed model analyses for evaluation with the model of generalizability theory the packages "lme4" and "lmerTest" were used. The basic parameters (mean score achieved x, difficulty P and corrected selectivity rit) of the scores obtained at the stations are listed in table 3. Figure 1 contains a graphical representation of the distributions as a box plot. The difficulties at the individual stations range from P=0.794 at the "Anamnesis Abdomen" station to P=0.959 at the "Blood Pressure Measurement" station. An average of 87.632 out of a maximum of 100 points was achieved. Please note that in contrast to dichotomous items, where only 0 or 1 point can be achieved, with finer granular evaluations (here 0-25 points) selectivities can possibly be interpreted even if the difficulties are numerically high. Eleven of the 15 stations have part-hole corrected selectivities of more than 0.300, two stations are just below this limit with selectivities of 0.276 and 0.296 ("Physical Examination Blood Pressure" and "Physical Examination Neurology"). Significantly lower are the stations "Physical Examination Liver" with r it =0.112 and "Pulse status" with r it =0.099.   figure 3). In this comparison, however, it must be taken into account that in the OSCE General Medicine, the point total of the other stations used for the corrected selectivity is determined from only three stations, in contrast to Internal Medicine and Surgery with nine and twelve stations, respectively. This means that this sum is subject to more error variance in the OSCE General Medicine. A better possibility for com-

Measurement reliability
Methods of Generalizability theory were used to analyse measurement reliability. A model with the facets "student", "station", "examiner" and the interaction "station x examiner" was analysed. Table 5 shows the estimated variance components of the facets.  Table 5: Variance components for the facets of the student, station, examiner and station x examiner. The standard deviation indicates the size of the influence of the respective effect in points at a station.  The total variance is divided into the components "student", "station", "examiner", the interaction "station x examiner" and the residual variance.
Nearly 53% of the variance can be explained by the effects of the model, with 22% attributable to differences between students in terms of performance. The variability of the stations accounts for 21%, while the combined examiner influence was around 10%. The interaction effect station x examiner was not detectable or significantly different from 0. The expected correlation of the point values achieved in the OSCE with an equivalent OSCE is Eρ 2 =0.647. These values do not take into account the effects of station and examiner, since in an equivalent parcours, all students pass through the same stations with the same examiners, so their total achieved points are only changed by these facets by a value that is constant for all and is not taken into account in a correlation (Eρ 2 is thus a measure of the relative measurement accuracy). In contrast, the Dependability Φ, as a measure of the absolute measurement accuracy, takes these factors into account, and is Φ=0.525 for the test.
Comparison with summative OSCEs Figure 5 shows graphically the percentage shares of the variance components for the OSCEs. A quality comparison of the OSCE General Medicine with those of Internal Medicine and Surgery with regard to the quality of the stations and the extent of the examiner's influences must take into account the different number of stations. As an example, table 6 lists the values obtained on a Parcour with ten stations. It can be seen that for Eρ 2 three of the  six comparison OSCEs have both lower and higher values. The absolute accuracy is higher for four comparison OSCEs. As can be seen in figure 5, this is mainly due to the higher variability of the stations.

Supervision
In 135 assessments, an additional examination was carried out by a supervisor (medical staff of the Department of General Practice and Implementation Research and Medical Psychology), which serves as quality assurance of the OSCE (see table 1). Table 7 shows the mean values of the assessments by the examiners, as well as those of the supervisors for the wards with double assessments. In addition, the significance value of the test for difference of assessments (Wilcoxon signed-rank test) is given. Only one station ("Anamnesis Abdomen") shows a statistically significant difference. Table 7 contains the correlations between examiners and supervisors at the stations, these ranged from 0.729 to 0.989. As examples, the scatter plots (bubble chart) of the assessments for the wards "Back Anamnesis" and "Physical Examination Neurology" are shown in figure 6. An overall analysis based on the Generalizability theory of all data (student examiners and supervisors) with the examiner group as a fixed factor and with separate variance components for the two examiner groups is shown in table 8. The supervisors give 0.568 points less than the student examiners, but the effect is not significant (p=0.152). The examiner effects have a standard deviation of 0.700 points (see also

Discussion
The results show that the stations of the OSCE General Medicine 2018 essentially fulfill the same quality criteria as the stations that are tested in the OSCEs in the subjects of Surgery and Internal Medicine, which have been established for years. In two of the physical examination stations, a review is recommended due to low selectivity. The matching of the assessments of the student examiners with those of the supervisors can be described as good to very good at all stations. Systematic differences between the assessments of the student examiners and the supervisors cannot be demonstrated. Although there is a relative influence of the examiners, the examiner effects tend to be even lower than in the comparison OSCEs.
The generalizability standardized on ten stations is noticeably higher in the OSCE General Medicine with Eρ 2 =0.82 compared to the two studies mentioned in Khan's review [21], in which an analysis was carried out using the Generalizability theory, in [10] and marginally higher in [22] (Eρ 2 =0.51 for the checklist and Eρ 2 =0.63 for the "global score" and Eρ 2 =0.80 for the "total score"). Apart from the number of stations, the measurement reliability of the OSCE examination in General Medicine is fully in line with the summative comparative OSCEs in 1. students instead of experts can be used as examiners of practical skills and 2. the quality of a formative examination with student examiners is similar to that of established summative OSCEs with experts as examiners.
Since the implementation of practical format-based exams, which record the level of knowledge for students themselves as well as for teachers in a structured manner, often fails at the faculties due to the availability of examiners from the teaching staff, students in higher semesters offer a convenient alternative to substitute them.
The only weakness of the OSCE General Medicine is the small number of four stations that the examined students have to pass through. However, the fact that four stations does not provide a measurement reliability that meets the requirements of high-quality examinations is not surprising. This is in line with the literature, which demands significantly higher numbers of stations for OSCEs in order to obtain overall evaluations that can be classified as meaningful [25]. The analysis of other formative examinations in which students act as examiners is of course desirable, since it is not possible to generalize to other institutions, general conditions, or the like from the individual case presented here. Such investigations could show which conditions must be met for the use of student examiners in order to obtain statistically satisfactory and meaningful performance assessments.

Limitations
The random second assessment by the supervisors were not carried out systematically, so that the comparisons with the student assessors are partly based on very small data sets (see table 7). There is also a room for improvement in the systematic allocation of the two physical examination stations from the set of eleven available stations among the examined students.

Conclusion
Overall, the OSCE General Medicine shows that it is possible to assess a large number of students with student examiners and thus to conduct high quality formative practical examinations. The involvement of students in the process of creating formative performance assessments is thus a practical way for medical faculties to take advantage of the widely recognized benefits of feedback in university teaching with the help of structured performance recording.